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1 Introduction 

We consider the problem of detecting the dependence/homology of two sequences of finite 
alphabet with comparable length. The classical measures of similarity in the sequence 
comparison are based on the score of optimal alignment of the sequences of interest (see, 
e.g. |2} El HI EJ [6] ) . The optimal alignment is in general not unique, but all optimal align- 
ments provide the same score. Hence, the difference between various optimal alignments 
is not taken into account. Our method is based on the observation that for many scor- 
ing schemes, especially for the longest common subsequence (LCS) scoring, the optimal 
alignments are the more different the unrelated are the sequences. This gives the idea to 
use the variety of the optimal alignments as an additional measure of the homology. A 
(partial) theoretical justification of the idea is given in [T], where the differences between 
LCS-optimal alignments were measured in terms of distances between so-called extremal 
alignments. The main result of [1] states that for related sequences (in certain sense), the 
distance between extremal is of order Inn, where n is the length of the both sequences. It 
has not proven that for independent sequences the distance between extremal alignments 
increases faster than Inn, but the simulations in [1] show that this is indeed the case. 
Hence, the distance between extremal alignments could provide important information 
about the similarity of the sequences. To test that idea for actual DNA-sequences, in 
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this note we apply the above mentioned ideas to four biologically similar genes, and we 
compare our test results with the outcomes of commonly used BLAST program (see 0|2|). 

The paper is organized as follows. In the next section, we briefly explain the setup and 
main results of [TJ. In the last section, the results of the simulation study are presented. 

2 Theoretical background 
2.1 Extremal alignments 

Let A be a finite alphabet. In the context of DNA-sequences, A obviously consists of 
four letters. In everything that follows, X = X± . . . X n E A n and Y = Y\ . . . Y n E A n 
are two strings of length n. A common subsequence of X and Y is a sequence that is a 
subsequence of X and at the same time of Y. We denote by L n the length of the longest 
common subsequence (LCS) of X and Y. The length of LCS is clearly an important tool 
in the sequence comparison, the bigger is L n the more the sequences are presupposed to 
be related. It is well known that for ergodic sequences (in particular, for independent iid 
sequences), the relative length of the LCS converges to a constant, i.e. 

— ^7, a.s., (2.1) 

n 

where 7 is so-called Chvatal-Sankov constant. Unfortunately, the constant 7 is not ex- 
actly known for as simple cases as i.i.d. Bernoulli sequences. Moreover, the speed of 
convergence or the corresponding central limit law is not known. This all makes it very 
difficult to use L n for testing the independence and motivates us to look for some alter- 
native LCS-related criterions. 

We now explain the idea of extremal alignment. Recall that X and Y are both se- 
quences of length n. Let there exist two subsets of indices {ii,...,ijt} C {l,...,n} 
and ...,j k } C {l,...,ny satisfying i x < i 2 < . . . < i k , j x < j 2 < . . . < j k and 
Xi x = Yj 1 , Xi 2 = Yj 2 , . . . , Xi k = Yj k . Then X^ ■ ■ ■ Xi k is a common subsequence of X and 
Y and the pairs 

{(h,ji),...,(ik,jk)} (2.2) 

are the corresponding alignment. Then L n is the biggest k such that there exist such 
subsets of indices and any alignment corresponding to a longest common subsequence 
is called optimal. We consider every (optimal) alignment (12. 2p as a set of points in 
{1, . . . , n} x {1, . . . , n} and we shall call it as the two-dimensional representation of (op- 
timal) alignment. Note that different optimal alignments can result the same common 
subsequence, and usually there are more then one common subsequence. Hence, typically, 
there are many optimal alignments represented in {1, . . . , n} x {1, . . . , n}. We believe that 
the notions of extremal alignments will be clear through the following example. 
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Example: Let X =ATAGCGT, Y =CAACATG. There are two longest common sub- 
sequences: AACG and AACT. Thus L 7 = 4. To every longest common subsequence 
corresponds two alignment in form f |2.2 j) : the alignments (1, 2), (2, 3), (5, 5), (6, 7) and 
(1, 2), (2, 3), (5, 4), (6, 7) correspond to AACG; the alignments (1, 2), (3, 3), (5, 4), (7, 6) and 
(1,2), (3, 3), (4, 4), (7, 6) correspond to AACT. The corresponding two dimensional graphs 
are 
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Putting all four alignment into one graph, we see that on some regions all alignments are 
unique, but on some region, they vary: 
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In the picture above, the two black dots and the red dots correspond to the alignment 
that lies above all others. This alignment will be called highest alignment. Similarly 
the two black dots and the blue dots correspond to the alignment that lies below all 
others. This alignment will be called lowest alignment. In our example, thus the align- 
ment (1, 2), (2, 3), (5, 4), (6, 7) (corresponding to AACG) is the highest and the alignment 
(1,2), (3,3), (5,4), (7,6) (corresponding to AACT) is the lowest. The highest and lowest 
alignment will be called extremal alignments. 

Thus, the highest (lowest) alignment is the one that lies above (below) all other alignments 
in two-dimensional representation. The formal definition of the extremal alignments as 
well as the proof the definition is correct can be found in [1]. For big n, we usually align 
the dots in the two dimensional representation by lines. In Figure 1, taken from jTJ, there 
are extremal alignments (red) of two independent iid sequences of length n = 1000. It 
is visible that the extremal alignments are rather far from each other, in particular, the 
maximum vertical and horizontal distances are relatively big. 
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4 color sequences x and y, score LCS=645, hausdorff distance=33 , maximum vertical distance=73 (blue) , 
maximum horizontal distance=70 (magenta) , maximum horizontal length=458 (black) 
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Figure 1: The extremal alignments of two independent iid four letter sequences 



2.2 Related sequences 

Unrelated sequences X and Y are independent. In our setup, the relatedness is based on 
the assumption that there exists a common ancestor, from which both sequences X and 
Y are obtained by independent random mutations and deletions. In the following, the 
common ancestor is an .4.- valued iid process Z\, Z 2 , ■ ■ ■■ A letter Z^ has a probability to 
mutate according to a transition matrix that does not depend on i. Hence, a mutation 
of the letter Z{ can be formalized as /(Zj,^), where f : A x ~R — > A is a. mapping and 
£i is a uniformly distributed random variable. The mapping /j(-) := /(■, £ 4 ) from A to A 
will be referred as the random mapping. The mutations of the letters are assumed to be 
independent. This means that the random variables £i,£2) • • • or the random mappings 
/i,/2, • • • are independent (and identically distributed). After mutations, the sequence is 
fi(Zi), 72(^2), .... Some of its elements disappear. This is modeled via a deletion pro- 
cess Df,Df, . . . that is assumed to be an iid Bernoulli sequence with parameter p i.e. 
P(Df = 1) = p. If Df = 0, then fi{Zj) is deleted. The resulting sequence, let it be X, is, 
therefore, the following: Xi = fj (Zj) if and only if Dj = 1 and Y2i=i = Similarly, the 
sequence Y is obtained from Z. For mutations, fix an iid uniformly distributed sequence 
771,772,... so that the mutated sequence is h\(Zi), ^2(^2), . . . with /?,;(•) := f(-,r]i). Note 
that the transition matrix corresponding to y-mutations equals the one corresponding to 
X-mutations implying that the random mappings hi and have the same distribution. 
Since the mutations of X and Y are supposed to be independent, we assume the sequences 
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£ and r) or the random mappings sequences /1, /2, ■ ■ ■ and /ii, /12, ■ ■ ■ are independent. Note 
that then the pairs (fi(Zi), hi(Zi)), (f 2 (Z 2 ), h 2 (Z 2 )), . . . are independent, but fi(Zi) and 
hi(Zi), in general, are not. Finally, some of the elements of /i 1 (Z 1 ), h 2 (Z 2 ), . . . are deleted 
according to a deletion process Df,D 2 , . . . consisting of iid Bernoulli random variables 
with the same parameter as D x but independent of D x . The remaining elements define 
F-sequence. Note that our definition of relatedness involves the independent sequences 
as a special case, when the functions / does not depend on Z. 

Example: The following table illustrates the generic process of obtaining X and Y. 



Z : 


Zi 


Z 2 


z 3 


Z 4 


Z 5 


z 6 


common ancestor 






f(Z): 


h{Zi) 


h{z 2 ) 




h{z A ) 


fs(Z 5 ) 


h(z 6 ) 


X mutations 


D x : 





1 


1 








1 


X deletions 


X : 




X\ 


x 2 






^3 








h(Z): 


/ii(Zi) 


h 2 (Z 2 ) 


h 3 (Z 3 ) 




h(Z 5 ) 


he{Z 6 ) 


Y mutations 


By : 


1 


1 


1 





1 





Y deletions 


Y : 


Yi 


Y 2 


Y 3 











4 color sequences x and y, score LCS=747, hausdorff distance=3 , maximum vertical distance=6 (blue) , 
maximum horizontal distance=4.5 (magenta) , maximum horizontal length=26 (black) 




x-sequence of length 949 

Figure 2: The extremal alignments of two related four letter sequences 

In [T], the related sequences were simulated and the corresponding extremal alignments 
were found. Figure 2 presents a typical picture or extremal alignments of two related 
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sequences of length 1000. Clearly the extremal alignments are close to each other; in 
particular the maximal vertical and horizontal distance is much smaller than these ones 
in Figure 1. The closeness of the extremal alignments of the related sequences follows 
from the main result of [Tj. Before we state the result formally, some notations need to be 
introduced. Let 7^ be the limit of (12. ip . where X and Y are related. Typically 7^ > 7, 
where 7 is the limit of independent sequences with the same laws. The existence of 7^ is 
proven in [I]. Let 

p(a) := P{Xi = a), q=l- minp(a), p a = P{Xi = YA = VVa) 2 , 

a ' J 

a 

p := ma.xp(a), q := 1 — min P(Xi = a\Yi = b), p:=^^-. 
aeA 0,6 pq 

When X and Y are independent, then q = q. The following theorem is the main result 
of pQ. Below, it is stated for vertical sequence, but it also holds for horizontal distance. 
The condition (12. 3p postulates the relatedness. It can be shown that for independent or 
very little related sequences (12.31) is not fulfilled. It does not mean that for independent 
sequences the inequality (12.41) fails, but the simulations in [1] show that this is the case. 
In the following theorem, h stands for the binary entropy function, i.e, h(p) = — plog 2 p — 
(1 -p) log 2 (l -p). 

Theorem 2.1 Let X and Y be related. Assume 

7R log 2 p+(l- 7R )log 2 (gg) + ((1- 7R ) A 7R )log 2 (pVl) +2/i( 7R ) <0. (2.3) 
Then there exist constants C < 00 and D < 00 such that for n big enough, 

P(V n > Chin) < Dn~ 2 , (2.4) 
where V n is the maximal vertical distance between extremal alignments. 

3 The case study 

Based on Theorem 12. II as well as the simulation study, we conjecture that properties of the 
extremal alignments could be used as a measure of relatedness. In particular, the maximal 
horizontal and vertical distance between extremal alignments might me a good measure. 
Also, as one can see from the Figures 1 and 2 (and from other similar simulations), for 
independent sequences, there are relatively long intervals where the extremal alignments 
do not coincide. In Figure 1, the biggest such interval has length 458 (the end of that 
interval is marked with *). This interval is called the maximal non-uniqueness stretch, and 
we conjecture that this can be a good measure of homology as well. A related criterion is 
the number of points where the extremal alignments coincide. We call it the number of 
uniqueness points. Clearly in Figure 1 the number of uniqueness points is relatively small 
in comparison with Figure 2. 
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We studied four bacterial genes with comparable length (about 1500 letters). The gene 
is dnaA and they were taken from bacteria Pseudomonas putida Fl (Gene nr.l), Pseu- 
domonas syringae pv. syringae B728a (Gene nr. 2), Escherichia coli E24377A (Gene nr. 
3) and Erwinia carotovora subsp. atroseptica SCRI1043 (Gene nr 4). The corresponding 
DNA-sequences can be found in Appendix. All the genes have the same function, there- 
fore they are presupposed similar. The results of the case study are in the following table. 



Genes 


1 


2 


3 


4 


Max; total 
Query 
E- value 
Maxldent 


2738; 2738 
100 


100 


1521; 1521 
100 


82 


625; 671 

71 
2e-175 
75 


529; 529 
61 
le-146 
72 


LCS 
Vert+Hor=Sum 
non-uniq st. 
uniq points 


1518 
0+0=0 


1518 


1298 
12+11=23 
26 

1003 


1081 

17+18=35 
79 
604 


1055 
20+24=44 
111 
520 


Max; total 
Query 
E- value 
Maxldent 


1521; 1521 
100 


82 


2771; 2771 
100 


100 


668; 722 
70 


76 


538; 592 

69 
3e-149 
73 


LCS 
Vert +Hor= Sum 
non-uniq st. 
uniq points 


1298 
12+11=23 
26 

1003 


1536 
0+0=0 


1536 


1097 

15+13=28 
45 
633 


1071 

14+24=38 
80 
565 


Max; total 
Query 
E- value 
Maxldent 


625; 671 

76 
2e-175 
75 


668; 722 
76 


76 


2533; 2533 
100 


100 


1323; 1323 
100 


81 


LCS 
Vert+Hor=Sum 
non-uniq st. 
uniq points 


1081 

17+18=35 
79 
604 


1097 

15+13=28 
45 
633 


1404 

0+0=0 


1404 


1196 

6+6=12 
21 
868 


Max; total 
Query 
E- value 
Maxldent 


529; 529 

67 
le-146 
72 


538; 592 

76 
2e-149 
73 


1323; 1323 
100 


81 


2522; 2522 
100 


100 


LCS 
Vert +Hor= Sum 
non-uniq st. 
uniq points 


1055 
20+24=44 
111 
520 


1071 

14+24=38 
80 
565 


1169 

6+6=12 
21 
868 


1398 
0+0=0 


1398 
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In the table, every (double) cell represents several similarity criterion between two genes. 
In the upper part of the cell, the standard outputs of BLAST-program is represented. 
The entries "Max" and "Total" are the maximum and total scores, respectively; "Query" 
is the Query-coverage, "E- value" and "Maxldent" are the e- value and max-ident, respec- 
tively. All parameters of BLAST were deliberately chosen default. The second half of the 
cell corresponds to the extremal alignments-based criterions. "LCS" stands for the length 
of the LCS, "Vert+Hor=Sum" is the sum of maximal vertical and horizontal distance be- 
tween the extremal alignment, "non-uniq st." is the length of the longest non-uniqueness 
stretch and "uniq points" is the number of uniqueness points of the extremal alignments. 

From the table, it is evident that Genes 1 and 2 and 3 and 4 are closely related: the 
maximum and total scores of BLAST between pairs (Gene 1, Gene 2) and ( Gene 3, Gene 
4) are remarkably higher than the ones of any other pair of different genes. Note that this 
difference is also well represented by the number of uniqueness points and, remarkably 
well by the length of the longest non-uniqueness stretch. Also, the sums of maximum hor- 
izontal and vertical distances are in full correspondence with other criterions measuring 
well the degree of relatedness. Finally and, perhaps, most importantly note that all ex- 
tremal alignments based criterions seem to be more sensible to the relatedness, although 
also the length of LCS shows the similarities rather well. 
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4 Appendix 



Genel: Pseudomonas putida Fl 



GTGTCAGTGGAACTTTGGCAGCAGTGCGTGGAGCTTCTGCGCGATGAACTGCCTGCCCAGCAATTCAACA 

CCTGGATCCGTCCGCTACAGGTCGAAGCCGAAGGCGACGAGTTGCGCGTCTATGCGCCTAACCGTTTCGT 

TCTCGATTGGGTCAATGAAAAGTACCTGGGTCGTTTGCTCGAGCTGTTGGGTGAGAACGGTAGCGGCATT 

GCACCAGCCCTTTCCTTATTAATAGGTAGCCGCCGCAGCTCGGCCCCAAGGGCTGCACCCAACGCGCCGG 

TCAGCGCTGCCGTTGCGGCTTCGCTGGCGCAGACTCAGGCGCACAAGACGGCCCCGGCAGCAGCGGTTGA 

ACCCGTTGCCGTGGCCGCGGCCGAGCCTGTATTGGTCGAGACGTCTTCGCGTGACAGCTTTGATGCCATG 

GCCGAGCCTGCTGCTGCGCCGCCCAGTGGTGGCCGGGCTGAACAGCGCACCGTGCAGGTTGAAGGTGCGC 

TCAAGCACACCAGTTACCTGAACCGGACCTTTACCTTTGACACCTTCGTCGAAGGTAAGTCGAACCAGCT 

CGCCCGCGCGGCTGCCTGGCAGGTTGCGGACAACCCTAAGCATGGCTACAACCCACTGTTCCTTTATGGC 

GGTGTGGGTTTGGGTAAAACCCACCTTATGCATGCTGTGGGTAACCATCTGCTGAAGAAGAATCCGAACG 

CCAAGGTGGTGTACCTGCATTCGGAGCGCTTCGTCGCGGACATGGTCAAAGCGTTGCAACTCAACGCCAT 

CAACGAATTCAAGCGCTTCTACCGCTCGGTGGACGCGTTGCTGATCGACGATATCCAGTTCTTCGCTCGC 

AAAGAGCGCTCGCAAGAAGAGTTTTTCCACACCTTCAACGCCTTGCTTGAGGGTGGCCAGCAGGTAATCC 

TTACCTCTGACCGCTATCCCAAGGAAATCGAAGGCCTGGAAGAGCGTCTGAAGTCGCGCTTTGGTTGGGG 

CCTGACGGTGGCTGTCGAGCCGCCAGAGCTGGAGACCCGCGTAGCGATCCTGATGAAGAAGGCCGACCAG 

GCCAAAGTCGAGCTCCCGCATGACGCAGCCTTTTTCATCGCTCAGCGCATCCGGTCCAACGTCCGTGAGC 

TGGAAGGTGCACTGAAGCGAGTTATTGCTCACTCGCACTTCATGGGGCGTGACATCACCATCGAGCTGAT 

TCGTGAATCGCTCAAGGATCTGTTGGCGCTGCAAGACAAACTGGTCAGTGTGGATAACATTCAGCGTACC 

GTCGCTGAGTACTACAAGATCAAGATCTCCGATCTGTTGTCCAAGCGTCGTTCGCGTTCTGTCGCGCGCC 

CGCGTCAGGTAGCCATGGCCCTGTCCAAGGAGTTGACCAACCACAGTCTGCCGGAAATCGGCGACATGTT 

CGGTGGTCGCGACCATACGACCGTGCTGCACGCCTGCCGCAAAATCAATGAACTGAAGGAATCCGACGCG 

GACATCCGCGAGGACTACAAGAACCTGCTGCGGACGCTGACGACCTGA 

Gene2: Pseudomonas syringae pv. syringae B728a 



GTGTCAGTGGAACTTTGGCAGCAGTGCGTGGAGCTTTTGCGCGATGAGCTGCCTGCCCAGCAATTCAACA 

CTTGGATCCGTCCGCTACAGGTCGAAGCCGAAGGCGACGAGTTGCGTGTGTACGCACCCAATCGTTTTGT 

TCTCGACTGGGTCAACGAAAAGTACCTTGGTCGTCTGCTCGAGCTTCTCGGCGAACACGGTCAAGGCATG 

GCCCCTGCTCTTTCCTTATTAATAGGAAGCAAGCGCAGCTCAGCACCGCGTGCTGCCCCGAATGCACCCT 

TGGCCGCTGCAGCCTCACAGGCGCTGTCTGCCAATTCGGTCAGCAGCGTCTCGGCCCCGGCTCCTGCCAC 

GGCTGCTCCAGCTGCTGCTGTAGCGACGCCTGCACCGGTTCAGAACGTTGCAACACACGACGAACCGTCG 

CGTGACAGCTTCGATCCGATGGCCGGAGCCAGCTCGCAACAAGCGCCCGCCCGCGCTGAACAACGTACCG 

TCCAGGTAGAAGGTGCGCTCAAGCACACCAGTTACCTGAACCGTACGTTCACGTTCGAAAATTTCGTCGA 

GGGTAAGTCCAACCAGCTGGCACGCGCTGCGGCCTGGCAGGTTGCCGACAACCCCAAGCATGGCTACAAC 

CCGCTGTTCCTTTATGGCGGCGTGGGTCTTGGTAAAACTCACTTGATGCATGCGGTGGGTAACCACCTGC 

TGAAGAAGAACCCGAACGCCAAGGTCGTGTACCTGCATTCGGAGCGCTTCGTTGCAGACATGGTCAAGGC 

CTTGCAGCTCAATGCAATCAACGAGTTCAAGCGCTTCTACCGTTCAGTCGATGCGCTGCTGATCGACGAC 

ATCCAGTTTTTTGCCCGCAAGGAACGTTCGCAGGAAGAGTTTTTCCACACGTTCAACGCGCTGCTGGAAG 

GCGGACAGCAGGTCATTCTGACCAGCGACCGCTATCCCAAGGAAATCGAAGGCCTTGAAGAGCGACTCAA 

ATCGCGTTTTGGCTGGGGCCTGACGGTTGCCGTCGAGCCTCCGGAGCTGGAAACCCGCGTGGCGATCCTC 

ATGAAAAAAGCAGATCAGGCCAAGGTCGATCTGCCCCATGACGCAGCGTTCTTCATCGCCCAGCGAATTC 

GCTCCAACGTCCGTGAGCTGGAAGGTGCGCTCAAGCGCGTCATCGCTCACTCGCACTTCATGGGCCGCGA 

CATCACCATCGAGCTGATTCGCGAGTCGCTGAAGGACTTGCTGGCGTTGCAGGACAAGCTGGTCAGTGTG 

GATAACATTCAGCGCACTGTCGCCGAGTACTACAAGATCAAGATTTCCGATCTGCTGTCCAAGCGTCGTT 

CCCGCTCTGTCGCCCGGCCTCGTCAGGTCGCGATGGCGCTCTCCAAGGAACTCACCAACCACAGTCTTCC 

GGAAATCGGTGACGTGTTTGGTGGCCGTGACCACACGACTGTCTTGCACGCATGCCGAAAGATCAACGAG 

CTCAAGGAATCCGATGCGGATATCCGCGAGGACTACAAGAACCTGCTGCGCACTCTGACTACGTGA 
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Gene3: Escherichia coli E24377A 



GTGTCACTTTCGCTTTGGCAGCAGTGTCTTGCCCGATTGCAGGATGAGTTACCAGCCACAGAATTCAGTA 

TGTGGATACGCCCATTGCAGGCGGAACTGAGCGATAACACGCTGGCCCTGTACGCGCCAAACCGTTTTGT 

CCTCGATTGGGTACGGGACAAGTACCTTAATAATATCAATGGACTGCTAACCAGTTTCTGCGGAGCGGAT 

GCCCCACAGCTGCGTTTTGAAGTCGGCACCAAACCGGTGACGCAAACGCCACAAGCGGCAGTGACGAGCA 

ACGTCGCGGCCCCTGCACAGGTGGCGCAAACGCAGCCGCAACGTGCTGCGCCTTCTACGCGCTCAGGTTG 

GGATAACGTCCCGGCCCCGGCAGAACCGACCTATCGTTCTAACGTAAACGTCAAACACACGTTTGATAAC 

TTCGTTGAAGGTAAATCTAACCAACTGGCGCGCGCGGCGGCTCGCCAGGTGGCGGATAACCCTGGCGGTG 

CCTATAACCCGTTGTTCCTTTATGGCGGCACGGGTCTGGGTAAAACTCACCTGCTGCATGCGGTGGGTAA 

CGGCATTATGGCGCGCAAGCCGAATGCCAAAGTGGTTTATATGCACTCCGAGCGCTTTGTTCAGGACATG 

GTTAAAGCCCTGCAAAACAACGCGATCGAAGAGTTTAAACGCTACTACCGTTCCGTAGATGCACTGCTGA 

TCGACGATATTCAGTTTTTTGCTAATAAAGAACGATCTCAGGAAGAGTTTTTCCACACCTTCAACGCCCT 

GCTGGAAGGTAATCAACAGATCATTCTCACCTCGGATCGCTATCCGAAAGAGATCAACGGCGTTGAGGAT 

CGTTTGAAATCCCGCTTCGGTTGGGGACTGACTGTGGCGATCGAACCGCCAGAGCTGGAAACCCGTGTGG 

CGATCCTGATGAAAAAGGCCGACGAAAACGACATTCGTTTGCCGGGTGAAGTGGCGTTCTTTATCGCCAA 

GCGTCTACGATCTAACGTACGTGAGCTGGAAGGGGCGCTGAACCGCGTCATTGCCAACGCCAACTTTACC 

GGAAGGGCGATCACCATCGACTTCGTGCGTGAGGCGCTGCGCGACTTGCTGGCATTGCAGGAAAAACTGG 

TCACCATCGACAATATTCAGAAGACGGTGGCGGAGTACTACAAGATCAAAGTTGCGGATCTCCTTTCCAA 

GCGTCGATCCCGCTCGGTGGCGCGTCCGCGCCAGATGGCGATGGCGCTGGCGAAAGAGCTGACTAACCAC 

AGTCTGCCGGAGATTGGCGATGCGTTTGGTGGTCGTGACCACACGACGGTGCTTCATGCCTGCCGTAAGA 

TCGAGCAGTTGCGTGAAGAGAGCCACGATATCAAAGAAGATTTTTCAAATTTAATCAGAACATTGTCATC 

GTAA 

Gene4: Erwinia carotovora subsp. atroseptica SCRI1043 

GTGTCACTTTCGCTTTGGCAGCAGTGTCTTGCCCGTTTGCAGGATGAGTTACCTGCCACAGAATTCAGTA 

TGTGGATACGCCCGTTGCAGGCGGAACTGAGTGATAACACTCTGGCGCTCTACGCCCCCAATCGCTTTGT 

GCTGGATTGGGTTCGTGATAAATACTTAAATAATATCAATGTCCTGCTGAATGATTTTTGCGGGATGGAT 

GCCCCCTTACTGCGTTTTGAAGTGGGGAGTAAACCGCTGGTTCAAACCATAAGCCAGCCAGCGCAGTCGC 

ACCACAACCCTGTCAGCGTTGCACGGCAACAGCCAGTACGCATGGCACCGGTACGCCCAAGCTGGGATAA 

CTCGCCTGTACAGGCAGAGCATACCTACCGTTCCAATGTGAACCCGAAACATACGTTTGATAACTTCGTT 

GAGGGTAAATCGAACCAGTTAGCACGGGCAGCGGCACGTCAGGTGGCTGACAACCCAGGCGGCGCGTATA 

ACCCGCTGTTTCTCTATGGCGGCACTGGCTTGGGTAAAACGCACCTGTTGCATGCAGTGGGGAATGGTAT 

TATCGCCCGTAAACCCAACGCGAAGGTGGTCTACATGCACTCCGAGCGTTTCGTGCAGGATATGGTGAAG 

GCGTTGCAGAACAATGCGATTGAAGAGTTCAAACGCTACTACCGTTCTGTTGACGCACTGCTGATCGATG 

ATATTCAATTCTTCGCTAATAAAGAGCGTTCGCAGGAAGAGTTCTTTCATACCTTTAATGCACTGCTGGA 

AGGCAACCAGCAAATCATTCTGACTTCTGACCGCTACCCGAAAGAGATCAATGGTGTGGAAGATCGTCTA 

AAATCCCGCTTTGGTTGGGGGTTAACGGTCGCGATTGAACCGCCTGAGCTGGAAACCCGCGTGGCGATTC 

TGATGAAAAAGGCAGATGAAAATGACATTCGCTTGCCTGGTGAAGTCGCATTCTTTATTGCTAAACGCCT 

GCGTTCTAACGTGCGTGAGTTGGAAGGTGCATTGAACCGCGTTATTGCTAACGCCAATTTTACCGGCCGT 

TCGATCACCATTGATTTTGTGCGTGAGGCGCTGCGCGATCTGCTGGCGTTGCAGGAAAAGCTGGTTACTA 

TCGACAATATTCAAAAGACCGTGGCGGAATACTATAAAATCAAGATAGCCGACCTGCTGTCTAAACGACG 

TTCCCGCTCGGTGGCGCGTCCGCGCCAGATGGCGATGGCGTTGGCGAAAGAACTGACGAATCACAGCCTG 

CCGGAAATTGGCGATGCCTTTGGCGGGCGTGATCATACGACGGTGTTGCATGCCTGCCGCAAGATTGAGC 

AGTTGCGTGAAGAAAGCCACGACATCAAAGAAGATTTTTCCAATTTAATCAGAACACTATCGTCATAA 
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